perm filename PORP1[7,ALS] blob sn#032374 filedate 1973-04-04 generic text, type T, neo UTF8
00010								April 3 1973
00020	
00030	 A Proposal for Speech Understanding Research
00040	
00050	
00060		It is proposed that the work on speech recognition that is
00070	now under way in the A.I. project at Stanford University be continued
00080	and extended as a separate project with broadened aims in the field
00090	of speech understanding. This work gives considerable promise both of
00100	solving some of the immediate problems that beset speech
00110	understanding research and of providing a basis for future advances.
00120	
00130		It is further proposed that this work be more closely tied to
00140	the ARPA Speech Understanding Research groups than it has been in the
00150	past and that it have as its express aim the study and application to
00160	speech recognition of a machine learning process, that has proved
00170	highly successful in another application and that has already been
00180	tested out to a limited extent in speech recognition. The machine
00190	learning process offers both an automatic training scheme and the
00200	inherent ability of the system to adapt to various speakers and
00210	dialects. Speech recognition via machine learning represents a global
00220	approach to the speech recognition problem and can be incorporated
00230	into a wide class of limited vocabulary systems.
00240	
00250		Finally we would propose accepting responsibility for keeping
00260	other ARPA projects supplied with operating versions of the best
00270	current programs that we have developed. The availability of the high
00280	quality front end that the signature table approach provides would 
00290	enable designers of the various over-all systems
00300	to test the relative performance of the top-down portions of their
00310	systems without having to make allowances for the deficiencies
00320	of their currently available front ends. Indeed, if the signature table
00330	scheme can be made simple enough to compete on a time basis (and we
00340	believe that it can) then it may replace the other front end
00350	schemes that are currently in favor.
00360	
00370		Stanford University is well suited as the site for such work,
00380	having both the facilities for this work and a staff of people with
00390	experience and interest in machine learning, phonetic analysis, and
00400	digital signal processing.
00410	
00420		Ultimately we would
00430	like to have a system capable of understanding speech from an
00440	unlimited domain of discourse and with unknown speakers. It seems not
00450	unreasonable to expect the system to deal with this situation very
00460	much as people do when they adapt their understanding processes to
00470	the speakers idiosyncrasies during the conversation. The signature table
00480	method gives promise of contributing toward the solution of this
00490	problem as well as being a
00500	possible answer to some of the more immediate problems.
00510	
00520		The initial thrust of the proposed work would be toward the
00530	development of adaptive learning techniques, using the signature
00540	table method and some more recent varients and extentions of this
00550	basic procedure. We have already demonstrated the usefulness of this
00560	method for the initial assignment of significant features to the
00570	acoustic signals. One of the next steps will be to extend the method
00580	to include acoustic-phonetic probabilities in the decision process.
00590	Ultimately we would hope to take account of syntactic and semantic
00600	constraints in a somewhat analogous fashion.
00610	
00620		Still another aspect to be studied would be the amount of
00630	preprocessing that should be done and the desired balance between
00640	bottom-up and top-down approaches. It is fairly obvious that
00650	decisions of this sort should ideally be made dynamicallly depending
00660	upon the familiarity of the system with the current domain of
00670	discourse and with the characteristics of the current speaker.
00680	Compromises will undoubtedly have to be made in any immediately
00690	realizable system but we should understand better than we now do the
00700	limitations on the system that such compromises impose.
00710	
00720		It may be well at this point to discribe the general
00730	philosophy that has been followed in the work that is currently under
00740	way and the results that have been achieved to date. We have been
00750	studying elements of a speech recognition system that is not
00760	dependent upon the use of a limited vocabulary and that can recognize
00770	continuous speech by a number of different speakers.
00780	
00790		Such a system should be able to function successfully either
00800	without any previous training for the specific speaker in question or
00810	after a short training session in which the speaker would be asked to
00820	repeat certain phrases designed to train the system on those phonetic
00830	utterances that seemed to depart from the previously learned norm. In
00840	either case it is believed that some automatic or semi-automatic
00850	training system should be employed to acquire the data that is used
00860	for the identification of the phonetic information in the speech. We
00870	believe that this can best be done by employing a modification of the
00880	signature table scheme previously discribed. A brief review of this
00890	earlier form of signature table is given in Appendix 1.
00900	
00910		The over-all system is envisioned as one in which the more or
00920	less conventional method is used of separating the input speech into
00930	short time slices for which some sort of frequency analysis,
00940	homomorphic, LPC, or the like, is done. We then interpret this
00950	information in terms of significant features by means of a set of
00960	signature tables. At this point we define longer sections of the
00970	speech called EVENTS which are obtained by grouping togather varying
00980	numbers of the original slices on the basis of their similarity. This
00990	then takes the place of other forms of initial segmentation. Having
01000	identified a series of EVENTS in this way we next use another set of
01010	signature tables to extract information from the sequence of events
01020	and combine it with a limited amount of syntactic and semantic
01030	information to define a sequence of phonemes.
01040	
01050		While it would be possible to extend this bottom up approach
01060	still further, it seems reasonable to break off at this point and
01070	revert to a top down approach from here on. The real difference in
01080	the overall system would then be that the top down analysis would
01090	deal with the outputs from the signature table section as its
01100	primatives rather than with the outputs from the initial measurements
01110	either in the time domain or in the frequency domain. In the case of
01120	inconsistancies the system could either refer to the second choices
01130	retained within the signature tables or if need be could always go
01140	clear back to the input parameters. The decision as to how far to
01150	carry the initial bottom up analysis must depend upon the relative
01160	cost of this analysis both in complexity and processing time and the
01170	certainty with which it can be performed as compaired with the costs
01180	associated with the rest of the analysis and the certainty with which
01190	it can be performad, taking due notice of the costs in time of
01200	recovering from false starts.
01210	
01220		Signature tables can be used to perform four essential
01230	functions that are required in the automatic recognition of speech.
01240	These functions are: (1) the elimination of superfluous and
01250	redundant information from the acoustic input stream, (2) the
01260	transformation of the remaining information from one coordinate
01270	system to a more phonetically meaningful coordinate system, (3) the
01280	mixing of acoustically derived data with syntactic, semantic and
01290	linguistic information to obtain the desired recognition, and (4) the
01300	introduction of a learning mechanism.
01310	
01320		The following three advantages emerge from this method of
01330	training and evaluation.
01340		1) Essentially arbitrary inter-relationships between the
01350	input terms are taken in account by any one table. The only loss of
01360	accuracy is in the quantization.
01370		2) The training is a very simple process of accumulating
01380	counts. The training samples are introduced sequentially, and hence
01390	simultaneous storage of all the samples is not required.
01400		3) The process linearizes the storage requirements in the
01410	parameter space.
01420	
01430		The signature tables, as used in speech recognition, must be
01440	particularized to allow for the multi-catagory nature of the output.
01450	Several forms of tables have been investigated. Details of the current
01460	system are given in Appendix 2. Some results are summarized in an
01470	attached report.
01480	
01490		Work is currently under way on a major refinement of the
01500	signature table approach which adopts a somewhat more rigorous
01510	procedure. Preliminary results with this scheme indicate that a
01520	substantial improvement has been achieved.